Last updated on 2017-03-08
Splines: Calculus and Linear Algebra at work.
The red curve is the LOESS smoothed curve:
intro_splines.R SolutionsCopy this at the end of intro_splines.R from previous lecture:
# SOLUTIONS: # 1. Note: Anything define in the ggplot() command trickles down to all the sub # geoms. If you want to use data/aes() specific to only one particular geom, # you define it inside: ggplot(data=NULL, aes(x=x)) + # What we're given - the obs y: geom_point(data=real_life_augmented, aes(y=y)) + # What we won't typically know, but here we pretend to - the f(x) i.e. the signal: geom_line(data=simulated, aes(y=f), col="red", size=1) + # Our guess at f(x), i.e. f_hat(x) - our fitted/predicted values: geom_line(data=real_life_augmented, aes(y=.fitted), col="blue", size=1) # 2. Let's toy around with df. In particular df=2, 10, 50, 99 real_life_augmented <- smooth.spline(real_life$x, real_life$y, df=50) %>% augment() %>% tbl_df() ggplot(data=NULL, aes(x=x)) + geom_point(data=real_life_augmented, aes(y=y)) + geom_line(data=real_life_augmented, aes(y=.fitted), col="blue", size=1) + geom_line(data=simulated, aes(y=f), col="red", size=1)
df do?To Chapter3_Lab.Rmd from last lecture (better viewed from HTML document):
# Load packages: library(tidyverse) library(MASS) library(ISLR) library(broom) # Fit model. Boston would be the training data in a Kaggle competition. # i.e. we have the outcome variable median house value model_SL <- lm(medv~lstat, data=Boston) # new_values would be the test data in a Kaggle competition # i.e. we don't have the outcome variable new_values <- data_frame(lstat=c(5,10,15)) # This is the MSE using the Boston training data: augment(model_SL) %>% summarise(MSE=mean((medv-.fitted)^2)) # Or equivalently augment(model_SL) %>% summarise(MSE=mean((.resid)^2)) # This value does not exist, b/c we don't have the real y, so we can't compute # the residuals: augment(model_SL, newdata = new_values) %>% summarise(MSE=mean((.resid)^2))
# Fit model model_full <- lm(medv~crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat, data=Boston) # See coefficients tidy(model_full) # This is the MSE using the Boston training data: # Or equivalently augment(model_full) %>% summarise(MSE=mean((.resid)^2))
# Fit model
model_polynomial <-
lm(medv~
crim+zn+indus+chas+nox+rm+age+dis+rad+tax+ptratio+black+lstat+
I(crim^2)+I(zn^2)+I(indus^2)+I(chas^2)+I(nox^2)+I(rm^2)+I(age^2)+I(dis^2)+I(rad^2)+I(tax^2)+I(ptratio^2)+I(black^2)+I(lstat^2),
data=Boston)
# See coefficients
tidy(model_polynomial)
# This is the MSE using the Boston training data:
# Or equivalently
augment(model_polynomial) %>%
summarise(MSE=mean((.resid)^2))
Today we do the following to a simple linear regression in R
broom PackageAs per the tidy tools manifesto, we introduce the broom package:
Compare:
From biology/medicine. Let
From economics/sociology. Let
Fig 5.1 p.177:
Fig 5.3 p.179: Leave-One-Out Cross-Validation
Fig 5.5 p.181: \(k=5\) Fold CV
Formalize what we mean by:
Fig 2.2 p.17:
Fig 2.2 p.17:
Survived <- ifelse(Sex == "female, 1, 0)\[Score = \frac{1}{n}\sum_{i=1}^{n}I(y_i = \widehat{y}_i)\] where
train data (891 rows): used by you to train modeltest data (418 rows): used by Kaggle to evaluate/score/validate modeltest doesn't havepseudo-train and pseudo-test sets via resampling from trainpseudo-testCreate disjoint pseudo_train and pseudo_test data sets using dplyr::anti_join
library(tidyverse)
# You may need to change your directory path:
train <- readr::read_csv("assets/Titanic/train.csv")
pseudo_train <- train %>%
sample_frac(0.8)
pseudo_test <- train %>%
anti_join(pseudo_train, by="PassengerId")
See RStudio Menu Bar -> Help -> Cheatsheets -> Data Manipulation.
Compute your pseudo-score
pseudo_test %>% # Create new column of predictions: mutate(Survived_predicted = ifelse(Sex == "female", 1, 0)) %>% # Compute score: summarize(Score = mean(Survived == Survived_predicted))
## # A tibble: 1 × 1 ## Score ## <dbl> ## 1 0.7865169
Compare this to Kaggle score of 0.76555. Why are they different?
## [1] 0.764 0.820 0.787 0.809 0.770 0.781 0.837 0.747 0.781 0.719
In this case, we have variability due to the resampling. The average of the 10 scores is 0.781
R: tidyverse and the pipe operator %>%tidyversetidyverse is a set of packages that work in harmony because they share common data representations and API design.
install.packages("tidyverse")
library(tidyverse)
Read more on RStudio's Blog.
tidyverse Core PackagesThese get loaded when you run library(tidyverse)
ggplot2 for data visualisation.dplyr for data manipulation.tidyr for data tidying.readr for data import.purrr for functional programming.tibble for tibbles, a modern re-imagining of data frames.tidyverse Non-Core PackagesThese get installed with tidyverse
hms for times.stringr for strings.lubridate for date/times.forcats for factors.tidyverse Data Import PackagesThese get installed with tidyverse
DBI for databases.haven for SPSS, SAS and Stata files.httr for web apis.jsonlite for JSON.readxl for .xls and .xlsx files.rvest for web scraping.xml2 for XML.tidyverse Modelling PackagesThese get installed with tidyverse
modelr for simple modelling within a pipelinebroom for turning models into tidy dataIf you are new to…
ggplot2: Package for Data Visualization, do Data Visualization with ggplot2 (Part 1), (Part 2), and (Part 3)dplyr: Package for Data Manipulation, do Data Manipulation in R with dplyr and Joining Data in R with dplyr.For \(i=1,\ldots,n\) passengers
PassengerIdSurvived (binary)Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked.Naïve models: Use no information about the passenger, i.e. no covariates
Survived == 0Survived == 1Survived == 1 with probability \(p\)Survived == 0 with probability \(1-p\)Sex?Survived == 1 if Sex == "female"Survived == 0 if Sex == "male"Chalk talk i.e. see your lecture notes.
Load CSV's into R:
library(tidyverse)
gender_submission <- readr::read_csv("assets/Titanic/gender_submission.csv")
test <- readr::read_csv("assets/Titanic/test.csv")
train <- readr::read_csv("assets/Titanic/train.csv")
readr:: just indicates the function is from the readr package. Use for disambiguation.The binary outcome varible Survived is included.
glimpse(train)
## Observations: 891 ## Variables: 12 ## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,... ## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,... ## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,... ## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra... ## $ Sex <chr> "male", "female", "female", "female", "male", "mal... ## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ... ## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,... ## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,... ## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138... ## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ... ## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, ... ## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ...
Survived is now NOT included. There are 418 rows (passengers) you need to predict.
glimpse(test)
## Observations: 418 ## Variables: 11 ## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, ... ## $ Pclass <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2,... ## $ Name <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Nee... ## $ Sex <chr> "male", "female", "male", "male", "female", "male"... ## $ Age <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18... ## $ SibSp <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0,... ## $ Parch <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,... ## $ Ticket <chr> "330911", "363272", "240276", "315154", "3101298",... ## $ Fare <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7... ## $ Cabin <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "B... ## $ Embarked <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", ...
This is what you submit: 418 rows with
PassengerIdSurvivedglimpse(gender_submission)
## Observations: 418 ## Variables: 2 ## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, ... ## $ Survived <int> 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0,...
You can write to CSV via:
gender_submission %>% readr::write_csv("assets/Titanic/submission.csv")
After submitting to Kaggle, you can see your ranking.
ggplot2, dplyr, modelrFinal Project: Kaggle (rhymes with haggle) competition.